Skip to content

Evaluation Metrics

1. Text-to-Video Retrieval Recall

The text-to-Video retrieval task aims to establish semantic associations between text and video, enabling bidirectional retrieval. To comprehensively evaluate the performance of a retrieval system, a series of recall-based metrics are typically used, including Top-K recall (R@K) for both video-to-text retrieval (Video-to-Text, V2T) and text-to-video retrieval (Text-to-Video, T2V), as well as Mean Recall, which provides an overall measure of retrieval performance. These metrics can effectively reflect the model’s performance under different levels of retrieval difficulty, where Top-1 indicates precise matching ability and Top-5/10 indicate fault tolerance. Furthermore, these metrics help researchers evaluate the strengths and weaknesses of a model from multiple perspectives, such as whether it can still reliably return correct results when facing retrieval targets with similar semantics but subtle differences, or whether it can still include the corresponding video or text within the top K results when the video description is vague or incomplete, thereby demonstrating the model’s robustness and generalization ability.

1.1 Recall (Recall@K):

R@K=1Ni=1NI(rankiK)

Here, N is the number of test samples, ranki denotes the rank of the i‑th positive sample in the retrieval results, and I() is the indicator function. This metric measures the proportion of correct matches found within the given Top-K retrieval range. A higher R@K indicates that the model can include more true matches within the top K returned results, directly reflecting the effectiveness of the retrieval system.

2.1 Video-to-Text Recall (V2T R@K):

V2TK=1Mj=1MmaxlLjI(rankjlK)

M is the number of videos, and Lj is the set of positive text samples corresponding to the j‑th video. This metric measures the ability to retrieve the corresponding text from a given video, reflecting the model’s accuracy and robustness in processing video features, extracting semantic information, and matching it with text descriptions.

2.2 Text-to-Video Recall (T2V R@K):

T2VK=1Lk=1LmaxvVkI(rankkvK)

L is the number of text samples, and Vk is the set of positive video samples corresponding to the k‑th text. This metric measures the ability to retrieve the corresponding video from a given text, reflecting whether the model can successfully find the corresponding video content based on text descriptions of varying forms, lengths, and levels of detail.

2.3 Mean Recall:

MeanRecall=16(K{1,5,10}(V2TK+T2VK))

Mean Recall aggregates multiple Top-K recall values from both the V2T and T2V directions into a single comprehensive metric, providing an overall evaluation of the retrieval system’s average performance across different directions and difficulty levels. A higher Mean Recall indicates that the model performs well across all retrieval dimensions, excelling not only in unidirectional retrieval but also in diverse matching scenarios.

These metrics, through multi‑granularity evaluation (K=1,5,10), comprehensively reflect the retrieval system’s precise matching ability and fault tolerance, where:

  • R@1 measures strict matching accuracy and reflects whether the system can directly find the correct result in the first position
  • R@5 / R@10 reflect the system’s robustness when the retrieval range is relaxed, indicating whether the model can still cover the correct match under less strict conditions
  • Mean Recall provides a single metric for overall performance, allowing researchers to quickly compare the comprehensive performance of different models or configurations These metrics have been widely applied in the evaluation of models on mainstream video retrieval benchmarks such as MSR‑VTT and ActivityNet, and they are highly reliable and widely used.

2.4 Code

Retrieval metric calculation code: retrieval_evaluator